99 research outputs found
Simultaneous Stereo Video Deblurring and Scene Flow Estimation
Videos for outdoor scene often show unpleasant blur effects due to the large
relative motion between the camera and the dynamic objects and large depth
variations. Existing works typically focus monocular video deblurring. In this
paper, we propose a novel approach to deblurring from stereo videos. In
particular, we exploit the piece-wise planar assumption about the scene and
leverage the scene flow information to deblur the image. Unlike the existing
approach [31] which used a pre-computed scene flow, we propose a single
framework to jointly estimate the scene flow and deblur the image, where the
motion cues from scene flow estimation and blur information could reinforce
each other, and produce superior results than the conventional scene flow
estimation or stereo deblurring methods. We evaluate our method extensively on
two available datasets and achieve significant improvement in flow estimation
and removing the blur effect over the state-of-the-art methods.Comment: Accepted to IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR) 201
Event Camera Data Pre-training
This paper proposes a pre-trained neural network for handling event camera
data. Our model is a self-supervised learning framework, and uses paired event
camera data and natural RGB images for training.
Our method contains three modules connected in a sequence: i) a family of
event data augmentations, generating meaningful event images for
self-supervised training; ii) a conditional masking strategy to sample
informative event patches from event images, encouraging our model to capture
the spatial layout of a scene and accelerating training; iii) a contrastive
learning approach, enforcing the similarity of embeddings between matching
event images, and between paired event and RGB images. An embedding projection
loss is proposed to avoid the model collapse when enforcing the event image
embedding similarities. A probability distribution alignment loss is proposed
to encourage the event image to be consistent with its paired RGB image in the
feature space.
Transfer learning performance on downstream tasks shows the superiority of
our method over state-of-the-art methods. For example, we achieve top-1
accuracy at 64.83% on the N-ImageNet dataset
Bringing Blurry Images Alive: High-Quality Image Restoration and Video Reconstruction
Consumer-level cameras are affordable for customers. While handy and easy to use, images and videos are likely to suffer from motion blur effect, especially under low-lighting conditions. Moreover, it is rather difficult to take high frame-rate videos due to the hardware limitations of conventional RGB-sensors. Therefore, our thesis mainly focuses on restoring high-quality (sharp, and high frame-rate) images and videos, from the low-quality (blur, and low frame-rate) ones for better practical applications. In this thesis, we mainly address the problem of how to restore a sharp image from a blurred stereo video sequence, a blurred RGB-D image, or a single blurred image. Then, by utilizing the faithful information about the motion provided by blurry effects in the image, we reconstruct high frame-rate and sharp videos based on an event camera, that brings blurry frame alive.
Stereo camera systems can provide motion information incorporated to help to remove complex spatially-varying motion blur in dynamic scenes. Given consecutive blurred stereo video frames, we recover the latent images, estimate the 3D scene flow, and segment the multiple moving objects simultaneously. We represent the dynamic scenes with the piece-wise planar model, which exploits the local structure of the scene and expresses various dynamic scenes. These three tasks are naturally connected under our model and expressed as the parameter estimation of 3D scene structure and camera motion (structure and motion for the dynamic scenes).
To tackle the challenging, minimal image deblurring case, namely, single-image deblurring, we first focus on blur caused by camera shake during the exposure time. We propose to jointly estimate the 6 DoF camera motion and remove the non-uniform blur by exploiting their underlying geometric relationships, with a single blurred RGB-D image as input. We formulate our joint deblurring and 6 DoF camera motion estimation as an energy minimization problem solved in an alternative manner.
In general cases, we solve the single-image deblurring task by studying the problem in the frequency domain. We show that the auto-correlation of the absolute phase-only image (phase-only image means the image is reconstructed only from the phase information of the blurry image) can provide faithful information about the motion (e.g., the motion direction and magnitude) that caused the blur, leading to a new and efficient blur kernel estimation approach.
Event cameras are gaining attention for they measure intensity changes (called `events') with microsecond accuracy. The event camera allows the simultaneous output of the intensity frames. However, the images are captured at a relatively low frame-rate and often suffer from motion blur. A blurred image can be regarded as the integral of a sequence of latent images, while the events indicate the changes between the latent images. Therefore, we model the blur-generation process by associating event data to a latent image. We propose a simple and effective approach, the EDI model, to reconstruct a high frame-rate, sharp video (>1000 fps) from a single blurry frame and its event data. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable.
Then, we improved the EDI model by using multiple images and their events to handle flickering effects and noise in the generated video. Also, we provide a more efficient solver to minimize the proposed energy model.
Last, the blurred image and events also contribute to optical flow estimation. We propose a single image and events based optical flow estimation approach to unlock their potential applications.
In summary, this thesis addresses how to recover sharp images from blurred ones and reconstruct a high temporal resolution video from a single image and event. Our extensive experimental results demonstrate our proposed methods outperform the state-of-the-art
LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network
Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent
blur is a challenging task.~Existing blur map-based deblurring methods have
demonstrated promising results. In this paper, we propose, to the best of our
knowledge, the first framework to introduce the contrastive language-image
pre-training framework (CLIP) to achieve accurate blur map estimation from DP
pairs unsupervisedly. To this end, we first carefully design text prompts to
enable CLIP to understand blur-related geometric prior knowledge from the DP
pair. Then, we propose a format to input stereo DP pair to the CLIP without any
fine-tuning, where the CLIP is pre-trained on monocular images. Given the
estimated blur map, we introduce a blur-prior attention block, a blur-weighting
loss and a blur-aware loss to recover the all-in-focus image. Our method
achieves state-of-the-art performance in extensive experiments
LCCo: Lending CLIP to Co-Segmentation
This paper studies co-segmenting the common semantic object in a set of
images. Existing works either rely on carefully engineered networks to mine the
implicit semantic information in visual features or require extra data (i.e.,
classification labels) for training. In this paper, we leverage the contrastive
language-image pre-training framework (CLIP) for the task. With a backbone
segmentation network that independently processes each image from the set, we
introduce semantics from CLIP into the backbone features, refining them in a
coarse-to-fine manner with three key modules: i) an image set feature
correspondence module, encoding global consistent semantic information of the
image set; ii) a CLIP interaction module, using CLIP-mined common semantics of
the image set to refine the backbone feature; iii) a CLIP regularization
module, drawing CLIP towards this co-segmentation task, identifying the best
CLIP semantic and using it to regularize the backbone feature. Experiments on
four standard co-segmentation benchmark datasets show that the performance of
our method outperforms state-of-the-art methods
L2T-DLN: Learning to Teach with Dynamic Loss Network
With the concept of teaching being introduced to the machine learning
community, a teacher model start using dynamic loss functions to teach the
training of a student model. The dynamic intends to set adaptive loss functions
to different phases of student model learning. In existing works, the teacher
model 1) merely determines the loss function based on the present states of the
student model, i.e., disregards the experience of the teacher; 2) only utilizes
the states of the student model, e.g., training iteration number and
loss/accuracy from training/validation sets, while ignoring the states of the
loss function. In this paper, we first formulate the loss adjustment as a
temporal task by designing a teacher model with memory units, and, therefore,
enables the student learning to be guided by the experience of the teacher
model. Then, with a dynamic loss network, we can additionally use the states of
the loss to assist the teacher learning in enhancing the interactions between
the teacher and the student model. Extensive experiments demonstrate our
approach can enhance student learning and improve the performance of various
deep models on real-world tasks, including classification, objective detection,
and semantic segmentation scenarios
- …